3. explore the data.
I. First, I would like to see the chemical elements response to the
It is surprise to see the first three graph have similar trends which
show that people in Black seems to have outstanding highest A500, A650,
PTCA. However, the difference is not very big especially for the first
two graphs since the scales are small. However, people in red tend to
have more H_4AHP. But all the others has pretty small value of H_4AHP.
The difference seems to be big.
II. Next, I would like to check for the influence on ancestry.
Firstly, I see some zero value in the column which I would like to remove
since they’re meaningless.
Then I make plots to see the influence.

I don’t see a outstanding pattern relationship between chemicals and
From the graph of A500, A650, PTCA, we can see that they show that most
dots fall between 1 and 2 since the range of them are the largest. And
for the ancestry of 1,3,6, they tend to have bigger value of chemicals.
However, for the graph of H_4AHP, it is salient to see that the mode of
the data falls on 2. It is also happy to see that the highest value and
medium also falls onto it.
III. let’s further explore whether there are some relationship between
We can see that those three graphs display a clear positive linear
relationship. Thus, I expect to see a simple linear regression model for
From the result, we see that the p-value for the first and second model
are very small and the adjusted R-squared are higher than 0.9. However,
for the third model, we can see that the p-value for intercept is big,
even though its adjusted R squared is large. Thus, it definitely needs
further check.
Let’s see the confidence interval.
## 2.5 % 97.5 %
## (Intercept) 0.01212621 0.01751386
## A650 2.86583262 2.96391879
## 2.5 % 97.5 %
## (Intercept) 0.0032014187 0.0183571045
## PTCA 0.0007571028 0.0008314545
## 2.5 % 97.5 %
## (Intercept) -0.0038011631 0.0010243396
## PTCA 0.0002606674 0.0002843407
And from the confidence interval of three models, we don’t see the value
of zero for model 1 and 2 which also indicates that there is a
relationship. However, for the model 3, the result is not very ideal.
Let’s further check for the correlation test.
##
## Pearson's product-moment correlation
##
## data: x$A650 and x$A500
## t = 117.58, df = 131, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9933715 0.9966618
## sample estimates:
## cor
## 0.9952954
##
## Pearson's product-moment correlation
##
## data: x$PTCA and x$A500
## t = 42.266, df = 131, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9513183 0.9752239
## sample estimates:
## cor
## 0.9652351
##
## Pearson's product-moment correlation
##
## data: x$PTCA and x$A650
## t = 45.543, df = 131, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9577306 0.9785220
## sample estimates:
## cor
## 0.9698426
However, from the result, we see that all three of these p-value for the
test are very small. And the result of the correlation is big. Thay are
all over 0.9 which indicates a strong linear relationship.
Let’s further test whether slope is significant.
## Analysis of Variance Table
##
## Response: A500
## Df Sum Sq Mean Sq F value Pr(>F)
## A650 1 1.11570 1.11570 13824 < 2.2e-16 ***
## Residuals 131 0.01057 0.00008
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: A500
## Df Sum Sq Mean Sq F value Pr(>F)
## PTCA 1 1.04932 1.04932 1786.4 < 2.2e-16 ***
## Residuals 131 0.07695 0.00059
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: A650
## Df Sum Sq Mean Sq F value Pr(>F)
## PTCA 1 0.123512 0.12351 2074.2 < 2.2e-16 ***
## Residuals 131 0.007801 0.00006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the results show that p-value are very small for these three model.
The slopes seem to be significant.
From the graph, we can see that the it has constant residual value, but a
big shape in normal QQ plot for first model. However, it seems to meet the
requirement of normal for second and third graph.
Let’s see the outliers.
## 135 124 89 77 9 116
## -1.844912277 -1.822951089 -1.653918283 -1.585726276 -1.421650077 -1.353602868
## 75 94 70 50 71 92
## -1.259928575 -1.225690513 -1.196502352 -1.109573500 -1.065230889 -1.060227079
## 83 121 24 20 134 126
## -1.011479228 -0.997056037 -0.973809214 -0.973768404 -0.967662993 -0.943770511
## 16 1 90 54 72 8
## -0.936739657 -0.902309500 -0.880118444 -0.857264151 -0.819057254 -0.806161841
## 93 79 68 74 43 125
## -0.797196338 -0.774213408 -0.764635675 -0.761948914 -0.733472800 -0.718913725
## 37 27 52 34 127 46
## -0.676647293 -0.646581580 -0.640713586 -0.628263714 -0.626025547 -0.593279500
## 44 76 5 36 53 123
## -0.555394742 -0.547982431 -0.541080399 -0.538477076 -0.536468431 -0.528738830
## 114 58 60 91 35 86
## -0.509396182 -0.505996783 -0.496392490 -0.486685491 -0.481518121 -0.480445010
## 40 49 120 69 105 29
## -0.450179192 -0.421868082 -0.397272703 -0.388742495 -0.358784040 -0.358358279
## 112 95 84 48 57 108
## -0.339580491 -0.333290797 -0.329988111 -0.320475417 -0.311002345 -0.285149223
## 106 33 56 31 80 18
## -0.275541881 -0.262101482 -0.254127211 -0.246435411 -0.234354840 -0.216159265
## 45 14 63 32 21 98
## -0.216159265 -0.168616613 -0.165424890 -0.153520105 -0.141952811 -0.137547548
## 6 10 4 64 38 82
## -0.125005830 -0.125005830 -0.065708983 -0.063565654 -0.048832740 -0.041794641
## 109 22 19 30 12 117
## 0.003503465 0.035360020 0.037134281 0.039227175 0.065897818 0.068085028
## 111 47 3 23 115 99
## 0.070278533 0.097608787 0.130313517 0.137093768 0.146382449 0.163344694
## 107 119 100 110 96 39
## 0.172849661 0.190025428 0.209036845 0.247015608 0.275468173 0.290890430
## 81 118 15 55 73 42
## 0.332318245 0.340297726 0.380371905 0.393683672 0.396341749 0.450765050
## 25 2 17 97 7 28
## 0.463538915 0.469891557 0.520635989 0.537427105 0.604513154 0.652980633
## 128 113 67 122 102 51
## 0.733316062 0.817389037 0.836631528 0.845779892 0.892310810 0.907550475
## 13 61 66 78 41 104
## 0.940725027 1.156636909 1.204139989 1.256125829 1.458267161 1.593482510
## 130 129 103 132 62 65
## 1.642082533 1.723935989 1.751376147 1.763539507 1.777997195 1.860473027
## 101 131 85 133 26 59
## 1.921046641 1.988327233 2.114915979 2.127563333 3.279715404 3.324793379
## 11
## 3.662443365
## 58 52 30 90 76 46
## -3.112080075 -2.206717007 -1.899460644 -1.889051385 -1.885828497 -1.782682303
## 28 92 57 74 35 54
## -1.538627223 -1.285520291 -1.284652057 -1.284020249 -1.213103281 -1.144329965
## 38 91 50 45 56 116
## -1.142006352 -1.084778865 -1.073700148 -1.063732958 -1.032595565 -1.025983382
## 70 49 115 26 75 15
## -1.008517942 -0.963252533 -0.905843027 -0.861397321 -0.855853364 -0.851337392
## 31 36 43 124 48 93
## -0.840574229 -0.838956778 -0.812325798 -0.742799488 -0.714027783 -0.707865313
## 32 83 55 94 53 68
## -0.684898598 -0.656490964 -0.627463247 -0.621580273 -0.607103372 -0.594925127
## 4 72 37 89 51 106
## -0.554222190 -0.548642878 -0.527457785 -0.522120642 -0.522013329 -0.494459170
## 33 5 82 114 102 44
## -0.494450824 -0.446280470 -0.431842668 -0.402154993 -0.398856818 -0.394604169
## 126 73 110 112 29 9
## -0.391759683 -0.337929735 -0.260946771 -0.255954216 -0.254932977 -0.254371623
## 24 71 69 62 98 108
## -0.251573794 -0.234410607 -0.201705084 -0.199256883 -0.194016401 -0.193680846
## 81 40 95 86 34 118
## -0.158531129 -0.144608656 -0.143880151 -0.140366776 -0.068366847 -0.062723285
## 134 47 119 6 125 128
## -0.062723285 -0.016769751 -0.015389325 -0.014479099 -0.003779126 0.019066777
## 80 121 117 84 105 10
## 0.029080374 0.045362169 0.056264674 0.059794054 0.062109640 0.084387896
## 67 111 127 107 109 122
## 0.085235978 0.157268116 0.165577104 0.196334945 0.199299221 0.229667946
## 100 12 97 25 129 16
## 0.254485408 0.268243100 0.278924534 0.335871014 0.347173226 0.375823983
## 27 120 135 96 99 79
## 0.377965685 0.448782464 0.462833073 0.476773746 0.481483332 0.498754252
## 59 19 42 113 104 123
## 0.548810458 0.587378128 0.636536460 0.650284022 0.651224522 0.717166601
## 132 101 17 14 130 131
## 0.740388894 0.796906584 0.804054099 0.819826286 0.869041923 0.873545652
## 85 7 3 21 1 13
## 0.882628641 0.882685053 0.907129748 0.991456027 1.020150980 1.040105165
## 61 2 39 133 77 78
## 1.050767632 1.082428537 1.111093503 1.194122113 1.198144094 1.233885393
## 8 103 18 66 60 23
## 1.234275208 1.247358082 1.250640421 1.259253043 1.287951849 1.356165158
## 65 22 41 63 64 11
## 1.379073108 1.609781849 1.650247748 1.792496946 1.962663605 2.745536387
## 20
## 4.445738776
## 58 26 52 30 28 76
## -3.14931902 -2.24048971 -2.12155445 -2.06245930 -1.92028344 -1.81299185
## 90 46 57 38 35 74
## -1.68440123 -1.68363033 -1.25987843 -1.21096649 -1.11472246 -1.07913604
## 15 45 115 56 91 92
## -1.06980126 -1.05983734 -1.03437220 -1.01109742 -0.97454080 -0.96211754
## 62 51 54 49 55 31
## -0.92613257 -0.92541763 -0.89069283 -0.86923229 -0.83403851 -0.80731836
## 102 59 50 36 32 48
## -0.78549662 -0.73657135 -0.71386135 -0.68890309 -0.67674023 -0.64138100
## 70 43 4 116 73 82
## -0.60897536 -0.58230574 -0.57114378 -0.56515341 -0.52266106 -0.44847379
## 93 53 33 106 75 110
## -0.44434992 -0.43988691 -0.42819080 -0.42262622 -0.41904343 -0.37959341
## 68 129 81 83 37 128
## -0.33563815 -0.31403732 -0.30337118 -0.30337118 -0.29809193 -0.27202153
## 5 72 67 114 44 118
## -0.26473680 -0.26403300 -0.24214677 -0.22973765 -0.20339807 -0.20327497
## 94 98 112 29 108 119
## -0.18014204 -0.15389184 -0.14003158 -0.13165326 -0.09461969 -0.09225563
## 122 124 69 47 126 95
## -0.09015149 -0.07192120 -0.06210720 -0.05744389 -0.04483349 -0.02166985
## 40 117 6 86 104 97
## 0.02381844 0.03366341 0.03419808 0.04085643 0.06520576 0.08598825
## 101 132 89 85 24 80
## 0.09124434 0.09322785 0.09827647 0.10622111 0.11770089 0.12503214
## 10 111 107 131 71 25
## 0.14072406 0.14154960 0.14266348 0.14676469 0.17310953 0.17664193
## 34 100 84 105 109 12
## 0.17707122 0.19088469 0.19642704 0.21046074 0.21356481 0.26247399
## 130 125 9 134 113 96
## 0.28006575 0.28363918 0.29352593 0.31944032 0.37408546 0.40378625
## 127 133 121 99 42 19
## 0.42911927 0.43600385 0.44783816 0.45368361 0.50583059 0.61787082
## 120 103 17 27 61 7
## 0.64268250 0.64372563 0.65841948 0.66500385 0.67026887 0.70973457
## 65 13 16 78 79 66
## 0.74328760 0.74521796 0.77873741 0.82728833 0.84696913 0.87639108
## 3 14 2 123 39 21
## 0.92527950 0.95037173 0.97870302 0.98467743 1.08105699 1.12474084
## 41 135 23 18 1 11
## 1.19670350 1.23787257 1.40649814 1.43352370 1.45781825 1.50072784
## 60 8 22 77 63 64
## 1.58494613 1.65056323 1.72011257 1.92616535 1.99675146 2.13978787
## 20
## 5.17230102
There are 5 points of outliers for first model and 2 points of outliers
for second and third model which over standard deviation of 2.
Now, the graph look much better, especially for the requirements of normal
assumption for the model 1. Moreover, the p-value for the model 3 is much
more smaller than before, even though the residual is not very ideal as I
thought.
Let’s see the other variables.

##
## Call:
## lm(formula = A500 ~ H_4AHP, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.134575 -0.083009 -0.007986 0.075965 0.211277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1564502 0.0085648 18.267 < 2e-16 ***
## H_4AHP -0.0004306 0.0001484 -2.902 0.00435 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08988 on 131 degrees of freedom
## Multiple R-squared: 0.0604, Adjusted R-squared: 0.05322
## F-statistic: 8.421 on 1 and 131 DF, p-value: 0.004354

##
## Call:
## lm(formula = A650 ~ H_4AHP, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044536 -0.027118 -0.001468 0.024998 0.074561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.903e-02 2.898e-03 16.919 <2e-16 ***
## H_4AHP -1.663e-04 5.021e-05 -3.312 0.0012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03041 on 131 degrees of freedom
## Multiple R-squared: 0.07725, Adjusted R-squared: 0.0702
## F-statistic: 10.97 on 1 and 131 DF, p-value: 0.001199

##
## Call:
## lm(formula = PTCA ~ H_4AHP, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.566 -99.990 -0.532 85.921 237.367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 185.2349 10.2741 18.029 < 2e-16 ***
## H_4AHP -0.6187 0.1780 -3.476 0.00069 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.8 on 131 degrees of freedom
## Multiple R-squared: 0.08445, Adjusted R-squared: 0.07746
## F-statistic: 12.08 on 1 and 131 DF, p-value: 0.0006905
Those three graph’ shapes are very interesting. They show a shape which
significantly skews to the right. From the summary table, we can also see
that their variables do not represent an ideal relationship between
each other since their adjusted R square is very small.
Conclusion
In conclusion, I first explore the relationship between of races and chemicals. From the
graph, I find that people in Black have outstanding highest A500, A650,
PTCA. People in red tend to have more H_4AHP and their difference of scales
also seems to be big. Then I explore the relationship between ancestry
and chemicals. From the data, I discover that most data of A500, A650,
PTCA include 1 or 2 ancestry. And for the ancestry of 1,3,6, they tend to
have bigger value of chemicals. However, for the graph of H_4AHP, most
data includes the 2 ancestry. It is also happy to see that the highest
value and medium also falls onto it. Lastly, I further explore the
relationship between chemicals to see whether they influence each other.
It is clear to see that A500, A650 and PTCA have a positive linear
relationship to influence each other. We can also do the prediction by
the following models.
A500^0.8 = A650 * 3.437719 + 0.052589
sqrt(A500) = PTCA * 1.086e-03 + 1.762e-01
A650^0.6 = sqrt(PTCA) * 0.0140647 - 0.0260570